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Abstract 

An important tool to quantify the likeness of two probability measures 
are /-divergences, which have seen widespread application in statistics 
and information theory. An example is the total variation, which plays 
an exceptional role among the /-divergences. It is shown that every f— 
divergence is bounded from below by a monotonous function of the total 
variation. Under appropriate regularity conditions, this function is shown 
to be monotonous. 

Remark: The proof of the main proposition is relatively easy, whence 
it is highly likely that the result is known. The author would be very 
grateful for any information regarding references or related work. 



1 The total variation 

Let (fi,cr) be a probability space. A signed measure v is a cr-additive set 
function with values in lU {— oo, oo}, and so that either v > — oo or v < oo. I 
will use the standard term measure if v is nonnegative. To any signed measure 
v, there corresponds a Hahn-Jordan decomposition of f2 into two measurable 
sets P, N so that PUN = Q, PnJV = and 

*/+(.) = K-nP), *-(.) = -v(.nN) (i) 

are both (nonnegative) measures. Obviously, v = v + — v~ . Furthermore, the 
representation 

v + (A) = sup v(B), v~(A) = - inf v(B) (2) 

BcA BCA 

holds for every measurable set A. For a proof of these facts see [2]. The measure 
(v) = v + + v~ is called the variation measure of v, which in turn defines the 
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total variation \\v\\ — {v) (Q). If — 0, it follows easily from the previous 
statements that 

(u) (fl) = 2 sup \v{B)\. (3) 

BGcr 

A probability measure is a measure \i so that /J,(£l) = 1. For any two proba- 
bility measures, fi, v, the difference fi — v is a signed measure, and Equation ([3]) 
applies. Hence, 

||/i - HI = <M " f) («) = 2 sup \n(B) - v(B)\. (4) 

BScr 

Obviously, — ^|| is a metric for probability measures, namely the total vari- 
ation metric, with Equation providing two possible representations. If /i is 
absolutely continuous with respect to /i, then there is a third representation, 
namely 

llM-"ll=/ 1^- 1 !^- ( 5 ) 

Proof of this fact 



2 The /-divergences 

Equation ([5]) can be read as follows: 

IIm-HI= //(^m^ ( 6 ) 

with /(x) = | a; — 1|. There is a way to generalise this approach by using other 
forms of /. Let / be a convex function on R>o that vanishes at x = 1. Let /i, v 
two probability measures with /i being absolutely continuous with respect to v 
(which will be written as /i <C v). The j ' -divergence between fi and v is given 

by 

D f {»,v) = J fi%)&- (?) 

For, if fi = v we have 4^ = 1, we see that f(fj,,f) vanishes in this case. 
Furthermore, Df(pi,v) is non-negative. Indeed, by Jensen's inequality, 

= /(l) = /( J J^di/) < J f(^)dv = f((x, u). 

Note though that f(fi, v) may be infinite. Furthermore /(/i, v) may vanish even 
if fi 7^ v. To exclude this, further conditions on / have to be imposed, for 
example as in the following 

2.1. Lemma. Suppose there is an a € R so that the function 

g(x) := f{x) - a(x - 1) 
is non- negative and vanishes only if x = 1, then /(/i, v) vanishes only if fi = v. 
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Proof. The function g{x) is convex as well. Furthermore Df(fi,h i ) = D g {ii,v). 
But since g is non-negative, 



Dg(n,l/) 



can only vanish if .9(^7) is identical to zero, which implies that 4^ = 1 i/-a.s. 
But this means /1 = v. □ 

The concept of /-divergences was introduced by Csiszar [T] , who also noted 
the result in Lemma 12.11 Common choices for / are 



(y/x — l) 2 Hellinger divergence HE 

\x — 1| total- variation divergence TV 

x\og(x) Kullback-Leibler divergence KL 

(x — l) 2 Pearson divergence PE 

The transformation f*(x) — xf{l/x) yields a divergence Df* which is equal 
to Df but with interchanged arguments. Applying this transformation to the 
Kullback-Leibler divergence for example, we get a divergence which is also 
sometimes referred to as the Kullback-Leibler divergence, or alternatively as 
the Shannon divergence SH. The total variation divergence plays a central role, 
since all /-divergences allow for an estimate against TV, as will be shown in 
the following proposition, which forms the main result of this short note. 

2.2. Proposition. For two probability measures /i, v, it holds in general that 

/(l + iTV( M , u)) + f(l - ^TV( M , V )) < D f (p, v). 

Proof. The proof of this fact is a generalisation of the method used in [3] to 
prove the special case of the KL divergence. Since /(l) = 0, we have the general 
property that 

f(x) = /(max{x, 1}) + /(min{x, 1}). 



Using this fact and the convexity of / we get the general estimate 

du 



= J /(max{^,l})d^ + J /(min{^,l})d^ 
>f(J max{^, l}du) + f(J max{^, l}di/). 



Now use that 



r n l+X+\l 

maxjx, 1) = — 



2 

-II 



min{a;, 1} = 

to complete the theorem. □ 
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Recalling that always TV < 2, the proposition rises the question as to when 
the function /(l + x) + f(l — x) is monotonous on x G [0, 1]. The following 
lemma partially answers this. 

2.3. Lemma. Under the conditions of Lemma 12.11 the function <f>{x) = /(l + 
x) + f(l — x) is strictly monotonous on a; £ [0, 1]. 

Proof. The conditions imply that </>(0) = 0, <fi(x) > for x > 0, and that <fi is 
convex. Let < x\ < x 2 < 1. For any r g]0, 1[, 

(1 - t)4>(0) + r<t>(x 2 ) > 4>{{l - r)0 + tx 2 ) 

which obviously implies 0(x 2 ) > t$(x2) > 0(tx 2 ) (since r G]0, 1[). Now take 
t = xi/x 2 to get the result. □ 



As a corollary of Proposition ^. 21 we get the following well known estimates 
between TV and KL 

2.4. Corollary (Bretagnole-Huber and Furstemberg inequality) . 



TVO, v) < 2 VI -exp (-SH(^,i/)) < 2y^SH(jj~u) 

Recall that SH(/i,^) = KL(^, /i). A further useful estimate concerns the 
Hellinger divergence 

2.5. Corollary. For the Hellinger divergence HE, the estimate 



TV< 2-2^1-VHEj ifHE<l (g) 
2 otherwise 



holds. 

Proof. Theorem 12.21 gives the inequality 



HE j , 1 + iTV-l +( Ji-ItV-1 . (9) 



The right hand side of Equation Q is larger than I -i 1 — |TV — 1 ) , whence 



HE>^l-iTV-lj , 
which, after solving for TV, yields the result. □ 
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